data|>
select(-iso2c, -iso3c, -year)|>
visdat::vis_miss() +
ggtitle("Missing World Bank Data for 2020") 2025-01-08
Missing data is extremely common in applied research, how should we deal with it?
The default here is simply dropping missing data from analyses (listwise deletion), but does this make sense to do?
Call:
lm(formula = libdem_score ~ gini, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.56930 -0.16300 0.07276 0.21980 0.39146
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.774470 0.169130 4.579 2.26e-05 ***
gini -0.006482 0.004760 -1.362 0.178
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2604 on 63 degrees of freedom
(114 observations deleted due to missingness)
Multiple R-squared: 0.0286, Adjusted R-squared: 0.01318
F-statistic: 1.855 on 1 and 63 DF, p-value: 0.1781
Our approach to missing data will depend on our theory of why it exists.
MCAR means there is genuinely no “pattern” to the missingness.
Only a random subset of countries are studied in each “wave”
The power went out during a telephone survey
You were distracted while coding and made some errors
MAR means there is a pattern, but its based on something you can predict
In our global development data: Gini coefficients are generally not collected for wealthy countries
In a study of congressional twitter accounts: older members never sign up for the service
In a study of social movement participation: members of vulnerable groups decline to answer a question about protest
All these cases suggest that we can fully account for the characteristic that causes missing data. We have data on wealth for the global development data, we have data on member’s age from our congress study, and we can identify who belongs to vulnerable groups.
MNAR means that there is something systematic and unobserved that causes the missingness. This is especially tricky if the missingness depends on expected values of the missing data itself.
In the development data: if World Bank doesn’t collect data on highly unequal countries
In a study of protest effectiveness: smaller events rarely generate news coverage
In a study of the effect of education on income, people with low income expectations never enter the labor force
We don’t really even have partial data on the cause of missingness here: we might suspect that “low potential income” cause someone to decide not to enter the labor force, but we never observe wages for that person in the first place so this is potentially untestable.
MCAR: No bias. Missingness just adds “noise”
MAR: Potentially biased estimates, but also may be fixable
MNAR: Potentially biased estimates, but harder to address
Simply dropping observations with missing data can work as long as the data are missing completely at random.
Standard errors will be larger than they would be if you had all those observations, but the slopes are unbiased in a linear model
Set the missing variable to 0 and include an indicator for missingness
data$gini_missing <- is.na(data$gini)
data$gini_zeroes <- ifelse(is.na(data$gini), 0, data$gini)
lm(libdem_score ~ gini_zeroes + gini_missing, data=data)|>
summary()
Call:
lm(formula = libdem_score ~ gini_zeroes + gini_missing, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.56930 -0.19196 -0.02777 0.20530 0.52354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.774470 0.153152 5.057 1.06e-06 ***
gini_zeroes -0.006482 0.004310 -1.504 0.13438
gini_missingTRUE -0.462005 0.154736 -2.986 0.00323 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2358 on 176 degrees of freedom
Multiple R-squared: 0.1989, Adjusted R-squared: 0.1898
F-statistic: 21.86 on 2 and 176 DF, p-value: 3.328e-09
We could set missing values at their means, or could try using a regression model to predict the missing predictors using other data.
This would produce unbiased estimates under the MCAR or MAR case if we had the right model
But it would seriously underestimate the uncertainty in our data because we fail to incorporate the uncertainty from the imputations
imputation_model<-lm(gini ~ gdp_pcap + pop_over_65 + libdem_score , data=data)
data$gini_imputed <- predict(imputation_model, newdata=data)
lm(libdem_score ~ gini_imputed, data=data)|>
summary()
Call:
lm(formula = libdem_score ~ gini_imputed, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.50675 -0.21558 0.02165 0.21800 0.50440
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.146892 0.164428 6.975 6.71e-11 ***
gini_imputed -0.019556 0.004322 -4.525 1.14e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2464 on 168 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.1086, Adjusted R-squared: 0.1033
F-statistic: 20.48 on 1 and 168 DF, p-value: 1.138e-05
One approach is to assume nothing and just ask “how bad could it get?”
| min | observed | max | |
|---|---|---|---|
| gini | 0.011 | -0.006 | -0.012 |
| [0.005, 0.017] | [-0.016, 0.003] | [-0.015, -0.008] | |
| Num.Obs. | 179 | 65 | 179 |
| R2 Adj. | 0.073 | 0.013 | 0.187 |
| BIC | 28.3 | 20.0 | 4.9 |
Advantage: minimal assumptions (especially for limited extreme bounds on limited DVs)
Disadvantage: might overstate uncertainty, and would be plain wrong for MCAR
This really works best when you have a very small number of missing observations. If you can show that your results are similar even under the worst case scenario assumptions, then you’re in a pretty good spot.
Multiple Imputation Through Chained Equations
Replace all missing data with a placeholder (like the column mean)
Reset one column (call it X) to its observed values
Use a model to impute X
Replace X with the predicted values from step 3, then move on to the next column.
Repeat for all columns with missing data
Do all of these steps multiple times to create multiple versions of the data
Finally, we can fit the data and then pool the results to account for uncertainty in our imputed estimates:
| term | estimate | std.error | statistic | df | p.value |
|---|---|---|---|---|---|
| (Intercept) | 0.605 | 0.191 | 3.16 | 13.6 | 0.00722 |
| gini | -0.00557 | 0.00505 | -1.1 | 14.1 | 0.288 |
In practice, we would probably want to include more information here. Especially if we have things like old data from a previous analysis, we could potentially vastly improve our predictions and limit the amount of uncertainty we had to deal with here.
MICE only works for MCAR or MAR cases. In other words: if we can’t create a regression model that can “explain” the missingness, then it doesn’t solve anything
No free lunch: we need at least some data about each row.
More missing data = more uncertainty.
What if people with a certain value of the outcome are never observed? For example: people with low expected earnings may drop out of the labor force entirely. This would cause an underestimate of the impact of education on wages because the lowest earners are never actually part of the equation. Moreover, imputation for these cases would be pretty dubious: we’re already trying to create a regression model to predict the outcome!
set.seed(500)
N = 1000
educ_year = rpois(N, 15)
earnings = educ_year * 10 + rnorm(N, 0, 50)
lfp = ifelse(earnings < quantile(earnings, .20), 0, 1)
lfp = factor(lfp, labels=c('not in labor force', 'in labor force'))
df<-data.frame(educ_year, earnings, lfp)
df_comb <- bind_rows(
df |> mutate(group = "true model"),
df |> filter(lfp == "in labor force") |> mutate(group = 'observed only')
)
true_model<-lm(earnings ~ educ_year ,data=df)
observed_model<-lm(earnings ~ educ_year ,data=df[which(lfp == "in labor force"),])
ggplot(df_comb, aes(x=educ_year, y=earnings)) +
geom_point(aes(fill = lfp), alpha=.5, shape=21, color='black') +
geom_smooth(method='lm', col='orange', se=FALSE) +
facet_wrap(~group) +
xlab("education") +
ylab('earnings') +
ggtitle("hypothetical wage and education results") +
theme_minimal() +
scale_fill_manual(values = c("white", "black") )The inverse mills ration is a linear transformation of the inverse selection probabilities. So IMR values will be higher for observations that more closely resemble the missing observations.
# predict the missingness
stage_1<-glm(lfp == "in labor force" ~ educ_year,
data=df,
family=binomial(link='probit'))
# the inverse mills ratio of selection probability
first_stage_lp <- predict(stage_1)
df$imr <- dnorm(first_stage_lp)/pnorm(first_stage_lp)
# the second stage
stage_2 <- lm(earnings ~ 1 + educ_year + imr,
data=df[which(lfp == "in labor force"),])The corrected coefficients now resemble the real model (although we still need to correct the standard errors)
modelsummary(list("actual model"= true_model,
"complete case model" = observed_model,
"heckman corrected model" = stage_2))| actual model | complete case model | heckman corrected model | |
|---|---|---|---|
| (Intercept) | 9.267 | 68.450 | 29.252 |
| (6.063) | (5.962) | (21.542) | |
| educ_year | 9.453 | 6.594 | 8.489 |
| (0.390) | (0.367) | (1.066) | |
| imr | 32.313 | ||
| (17.067) | |||
| Num.Obs. | 1000 | 800 | 800 |
| R2 | 0.370 | 0.288 | 0.291 |
| R2 Adj. | 0.370 | 0.287 | 0.289 |
| AIC | 10564.0 | 8092.0 | 8090.4 |
| BIC | 10578.7 | 8106.1 | 8109.2 |
| Log.Lik. | -5279.006 | -4043.012 | -4041.217 |
| F | 586.460 | 322.149 | 163.388 |
| RMSE | 47.47 | 37.90 | 37.81 |
The sampleSelection pacakage can estimate this in a single step, and will automatically produce corrected standard errors.
library(sampleSelection)
model<-selection(
# stage 1 model
lfp=="in labor force" ~ educ_year,
# stage 2 model
earnings ~ educ_year,
method='2step', data=df
)
# showing only the coefficients and se in formatted table
coef(summary(model), part='outcome')|>
huxtable()|>
add_rownames(colname='variable')|>
add_colnames("")| variable | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 29.3 | 22.5 | 1.3 | 0.195 |
| educ_year | 8.49 | 1.12 | 7.57 | 8.2e-14 |
| invMillsRatio | 32.3 | 17.5 | 1.84 | 0.0657 |
For MCAR (truly missing at random):
Normal listwise deletion is fine! The real question is “how do you know this is the situation?”
descriptive statistics can help. MCAR data should be roughly balanced on observed characteristics.
For MAR (missing conditional on observed data):
Normal listwise deletion still works provided you can condition on the thing that causes data to be missing.
Imputation is probably harmless for these cases
For MNAR (missing conditional on unobserved variables):
Missing predictors: impute missing observations.
Missing outcomes: use a two stage model
For all missing data: less is better. If you can show the problem is minimal even under extreme assumptions, then there’s a lot less reason for concern.